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Abstract 

This paper describes universal lossless coding strategies for compressing sources on countably infinite alphabets. 
Classes of memoryless sources defined by an envelope condition on the marginal distribution provide benchmarks 
for coding techniques originating from the theory of universal coding over finite alphabets. We prove general upper- 
bounds on minimax regret and lower-bounds on minimax redundancy for such source classes. The general upper 
bounds emphasize the role of the Normalized Maximum Likelihood codes with respect to minimax regret in the 
infinite alphabet context. Lower bounds are derived by tailoring sharp bounds on the redundancy of Krichevsky- 
Trofimov coders for sources over finite alphabets. Up to logarithmic (resp. constant) factors the bounds are matching 
for source classes defined by algebraically declining (resp. exponentially vanishing) envelopes. Effective and (almost) 
adaptive coding techniques are described for the collection of source classes defined by algebraically vanishing 
envelopes. Those results extend our knowledge concerning universal coding to contexts where the key tools from 
parametric inference are known to fail. 

keywords: NML; countable alphabets; redundancy; adaptive compression; minimax; 

I. Introduction 

This paper is concerned with the problem of universal coding on a countably infinite alphabet X (say the se t 



Orhtskv and Santhanaml d2004l) 



of positive integers N+ or the set of integers N ) as described for example by 
Throughout this paper, a source on the countable alphabet X is a probability distribution on the set X N of infinite 
sequences of symbols from X (this set is endowed with the cr-algebra generated by sets of the form n™=i{ a; i} x ^ N 
where all x,i € X and n € N). The symbol A will be used to denote various classes of sources on the countably 
infinite alphabet X. The sequence of symbols emitted by a source is denoted by the A^-valued random variable 
X = (X n ) n£N . If P denotes the distribution of X, P n denotes the distribution of X\- n = Xi, X n , and we let 
A™ = {P n : P € A}. For any countable set X, l et dJli(X) be the set of al l probability measures on X. 



From Shannon noiseless coding Theorem (see 



Cover and Thomas , 



199 II) . the binary entropy of P n , H(X 1:f 



Epn [— log P(Xi :n )] provides a tight lower bound on the expected number of binary symbols needed to en- 
code outcomes of P n . Throughout the paper, logarithms are in base 2. In the following, we shall only consider 
finite entropy sources on countable alphabets, and we implicitly assume that H(Xi- n ) < oo. The expected 
redundancy of any distribution Q n £ 2Jl 1 (A'"), defined as the difference between the expected code length 
Ep [— logQ™(A' 1: „)] and H(Xi :n ), is equal to the Kullback-Leibler divergence (or relative entropy) D(P n , Q n ) = 



£ xe ^P"{x}logJ^=E 



log 



Q"(Xl-. n ) 



Universal coding attempts to develop sequences of coding probabilities (Q n ) n so as to minimize expected 
redundancy over a whole class of sources. Technically speaking, several distinct notions of universality have been 
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considered in the literature. A positive, real-valued function p(n) is said to be a strong (respectively weak) universal 
redundancy rate for a class of sources A if there exists a sequence of coding probabilities (Q n ) n such that for all 
n, R + (Q n , A") = sup PeA D(P n , Q n ) < p(n) (respectively for all P e A, there exists a constant C(P) such that 
for all n, D(P n ,Q n ) < C(P)p(n)). A redundancy rate p{n) is said to be non-trivial if lim„ -^p{n) = 0. Finally 
a class A of sources will be said to be feebly universal if there exists a single sequence of coding probabilities 
(Q n )n such that sup Pg A lim n ^-D(P n , Q n ) = (N ote that this notion of feeble universality is usually called weak 
universality, (see 



Kieffer 



1978 



Gvorfi et al 



19940 , we deviate from the tradition, in order to avoid confusion with 



the notion of weak universal redundancy rate). 

The maximal redundancy of Q n with respect to A is defined by: 

R+(Q n ,A n ) = sup D(P n ,Q n ). 

PEA 

The infimum of R + (Q n , A") is called the minimax redundancy with respect to A: 

H+(A")= inf R + (Q n ,A n ). 

Q n eOTi(Af n ) 

It is the smallest strong universal redundancy rate for A. When finite, it is often called the information radius of A™. 
As far as finite alphabets are concerned, it is well-known that the class o f stationary ergodic source s is feebly 



universal. This is witnessed by the performance of Lempel-Ziv codes (see 



Cover and Thomas, 



1991 ). It is also 



known that the class of st ationary 



universal redundancy rate ( [Shields 



ergodi c sources over a finite alphabet does not admit any non-trivial weak 



1993b . On the other hand, fairly large classes of sources ad mitting strong 



universal redundanc y rates and non-trivial weak universal redundancy rates have been exhibited (see 



Barron et al 



1998 



Catoni, 



2004 and references therein). In this paper, we will mostly focus on strong universal redundancy 
rates for classes of sources over infinite alphabets. Note that in the latter setting, even feeble universality should 
not be taken for granted: the class of memoryless processes on N + is not feebly universal. 



KielTer 



Gvorfi et al 



19781) ch aracterized feebly universal classes, and the argument was simplified by 



Gvorfi et al 



(1994) 



1993b . Recall that the entropy rate H(P) of a stationary source is defined as lim n H(P n )/n. This 



result may be phrased in the following way. 

Proposition 1: A class A of stationary sources over a countable alphabet X is feebly universal if and only if 
there exists a probability distribution Q € Wli(X) such that for every P E A with finite entropy rate, Q satisfies 
Ep log Qpfj.) < 00 or equivalently D(P 1 , Q) < oo. 

Assume that A is parameterized by and that can be equipped with (prior) probability distributions W in 
such a way that 8 i— > Pg{A} is a random variable (a measurable mapping) for every A C X n . A convenient way 
to derive lower bounds on R + (A n ) consists in using the relation E w [D(Pg, Q n )] < R + (Q n ,A n ). 

The sharpest lower bound is obtained by optimizing the prior probability distributions, it is called the maximin 
bound 

sup inf E w [D(P?,Q n )]. 
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It has been proved in a series of papers ( iGallager 



been derived from a general minimax theorem by 



Sion, 



1968 ; iDavissonl 1 19731 lHausslen, 1 19971) (and could also have 



1958) that such a lower bound is tight. 



Theorem 1: Let A denote a class of sources over some finite or countably infinite alphabet. For each n, the 
minimax redundancy over A coincides with 

R+(A n )= sup inf E w [D(P£,Q n )}, 

e,wea«i(e) Q"em 1 (x n ) 

where runs over all parameterizations of countable subsets of A. 

If the set A" = {P n : P £ A} is not pre-compact with respect to the topology of weak convergence, t hen both 
sides are infinite. A thorough account of topological issues on sets of probability measures can be found in (IDudley , 
20021) . For the purpose of this paper, it is enough to recall that: first, a subset of a metric space is pre-compact if for 
any e > 0, it can be covered by a finite number of open balls with radius at most e > 0; second, a sequence (Q n ) n 
of probability distributions converges with respect to the topology of weak convergence toward the probability 
distribution Q if and only if for any bounded continuous function h over the support of Q n 's, Eq„/i — > Eq/i. This 
topology can be metrized using the Levy-Prokhorov distance. 

Otherwise the maximin and minimax average redundancies are finite and coincide; moreover, the minimax 
redundancy is achieved by the mixture coding distribution Q n (.) = J e Pg(.)W(d0) where W is the least favorable 
prior. 

Anoth er approach to universal coding considers individual sequences (see 



Federetal 



1992 



Cesa-Bianchi and Lugosi , 



20061 and references therein). Let the regret of a coding distribution Q n on string x e N™ with respect to A be 
sup P „ gA log P" (x) /Q n (x) . Taking the maximum with respect to x € N™, and then optimizing over the choice of 
Q n , we get the minimax regret: 

P n (x) 

R*(A n ) = inf max sup log - ) '- . 

Q n &Hli{X n ) i£N™ pgA Q n {x) 

In order to provide proper insight, let us recall the pre cise asymptotic bounds on minimax redundan c y and 



regret for memoryless sources over finite alphabets (see 


Clarke and Barron 




1990; 


1994 


Barron et al. 




1998 




Xie and Barron, 


1997; 2000, Orlitskv and Sanfhanam, 2004, 


Catoni 


2004 


Szpankowski 


1998 


Drmota and Szpankowski, 


2004 


, and references therein). 



Theorem 2: Let X be an alphabet of m symbols, and A denote the class of memoryless processes on X then 



lim|i?+(A") 
lim(i?*(A n ) 



m — 


1 


log 


n 


2 




2ne 


m — 


1 


log 


n 


2 




2^T 



1 



log 
log 



r(i/2) m 

r(m/2) 

r(i/2) m 

r(m/2) 



For all n > 2: 



777 — 1 

R*(A n )< logn + 2. 



The last inequality is checked in the Appendix . 

Remark 1: The phenomenon pointed out in Theorem |2] holds not only for the class of memoryless sources over a 
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finite alphabet but also for classes of sources that are smoothly parameterized by finite dimensional sets (see 



Clarke and Barron 


Catoni. 


2004 


). 



1990; 



1994 



Barron et al 



1998 



Xie and Barron, 



1997 



2000 



Orlitsky and Santhanam, 



again 



2004 



The minimax regret deserves further attention. For a source class A, for every x € X n , let the maximum 
likelihood p(x) be defined as sup PgA P"(x). If Exgn™ ^( x ) < 00 j tne Normalized Maximum Likelihood coding 
probability is well-defined and given by 

p(x) 



Qnml( x ) ~~ 



EzeNl P( x ) 



Shtarkovl (11987b showed that the Normalized Maximum Likelihood coding probability achieves the same regret over 



all strings of length n and that this regret coincides with the minimax regret: 

iT(A") = log Y, PW- 

Memoryless sources over finite alphabets are special cases of envelope classes. The latter will be of primary 
interest. 

Definition 1: Let / be a mapping from N+ to [0, 1]. The envelope class A/ defined by function / is the collection 
of stationary memoryless sources with first marginal distribution dominated by /: 

Af = {P : Vi G N, P X {x\ < f(x) , and P is stationary and memoryless.} . 
We will be concerned with the following topics. 

1) Understanding general structural properties of minimax redundancy and minimax regret. 

2) Characterizing those source classes that have finite minimax regret. 

3) Quantitative relations between minimax redundancy or regret and integrability of the envelope function. 

4) Developing effective coding techniques for source classes with known non-trivial minimax redundancy rate. 

5) Developing adaptive coding schemes for collections of source classes that are too large to enjoy even a weak 
redundancy rate. 

The paper is organized as follows. Section [TT] describes some structural properties of minimax redundancies and 
regrets for classes of stationary memoryless sources. Those properties include monotonicity and sub-additivity. 
Proposition [5] characterizes those source classes that admit finite regret. This characterization emphasizes the role 
of Shtarkov Normalized Maximum Likelihood coding probability. Proposition [6] describes a simple source class for 
which the minimax regret is infinite, while the minimax redundancy is finite. Finally Proposition [3] asserts that such 
a contrast is not possible for the so-called envelope classes. 

In Section |III] Theorems |4] and |5]provide quantitative relations between the summability properties of the envelope 
function and minimax regrets and redundanc ies. Those results build on the non-asymptotic bounds on minimax 



redundancy derived by 



Xie and Barronl d 1997b . 



Section [TV] focuses on two kinds of envelope classes. This section serves as a benchmark for the two main 
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results from the preceding section. In Subsection IIV-A1 lower-bounds on minimax redundancy and upper-bounds on 
minimax regret for classes defined by envelope function k i— > 1 A Ck~ a are described. Up to a factor logn those 
bounds are matching. In Subsection IIV-BI lower-bounds on minimax redundancy and upper-bounds on minimax 
regret for classes defined by envelope function fculAC cxp ak are described. Up to a multiplicative constant, 
those bounds coincide and grow like log 2 n. 

In Sections [V] and [VlJ we turn to effective coding techniques geared toward source classes defined by power-law 
envelopes. In Section[V] we elaborat e on the idea s embodied in Proposition @] from SectionHU and combine mixture 



coding and Elias penultimate code (lElias 



1975b to match the upper-bounds on minimax redundancy described in 



Section [IV] One of the messages from Section [IV] is that the union of envelope classes defined by power laws, 
does not admit a weak redundancy rate that grows at a rate slower than n 1 ^ for any f3 > 1. In Section [VI] we 
finally develop an adaptive coding scheme for the union of envelope classes defined by power laws. This adaptive 
coding scheme combines the censoring coding technique developed in the preceding subsection and an estimation 
of tail-heaviness. It shows that the union of envelope classes defined by power laws is feebly universal. 



II. Structural properties of the minimax redundancy and minimax regret 



Propositions 1213 1 and l4l below are sanity-check statements: they state that when minimax redundancies and regrets 
are finite, as functions of word-length, they are non-decreasing and sub-additive. In order to prove them, we start 
by the following proposition which emphasizes th e role of the NML coder with respect to the minimax regret. At 



1987 



Haussler and Opper 



1997) 



best, it is a comment on Shtarkov's original work (Sht arkov , 

Proposition 2: Let A be a class of stationary memoryless sources over a countably infinite alphabet, the minimax 
regret with respect to A", R*(A n ) is finite if and only if the normalized maximum likelihood (Shtarkov) coding 
probability Q" ML is well-defined and given by 

p(x) 



Qnml( x ) — 



for xeT 



£ye*»p(y) 

where p(x) = sup PeA P™(x). 

Note that the definition of Q^ ML does not assume either that the maximum likelihood is achieved on A or that 
it is uniquely defined. 

Proof: The fact that if Q^ml I s well-defined, the minimax regret is finite and equal to 



log ( Yl p(y) 



is the fundamental observation of 



Shtarkov 



dl987h 



On the other hand, if R*(A n ) < oo, there exists a probability distribution Q n on X n and a finite number r such 
that for all x E X n , 

p(x) <rxQ"(x), 
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summing over x gives 

j5(x) < r < oo . 

■ 

Proposition 3: Let A denote a class of sources, then the minimax redundancy R + (A n ) and the minimax regret 
R*(A n ) are non-decreasing functions of n. 

Proof: As far as R + is concerned, by Theorem Q] it is enough to check that the maximin (mutual information) 
lower bound is non-decreasing. 

For any prior distribution W on a parameter set O (recall that {Pg : 9 G 0} C A, and that the mixture coding 
probability Q n is defined by Q n (A) = E w [Pe(A)]) 

E w [D(P£ +1 ,Q n+1 )] = 1(9; X 1:n+1 ) = I (9; (X 1:n , X n+1 )) > 1(9; X 1:n ) = E w [D(P r e \ Q n )] . 

Let us now consider the minimax regret. It is enough to consider the case where R*(A n ) is finite. Thus we may 
rely on Proposition [2] Let n and m be two positive integers. Let e be a small positive real. For any string x G X n , 
let P x 6 A, be such that P x {x} > p(x)(l - e). Then 

p(xz') > P x (x) x P x (x' | x) 

>p(x)(l-e) x P x (x' | x). 

Summing over all possible x 1 E X we get 

^p(xx') > p(x)(l - e) . 

x' 

Summing now over all x € A 171 and x'g A 1 , 
So that by letting e tend to 0, 

■ 

Note that the proposition holds even though A is not a collection of memoryless sources. This Proposition can be 
easily completed when dealing with memoryless sources. 

Proposition 4: If A is a class of stationary memoryless sources, then the functions n i— » R + (A n ) and n i— > i?*(A") 
are either infinite or sub-additive. 

Proof: Assume that R + (A n ) < oo. Here again, given TheoremQ] in order to establish sub-additivity for R + , 
it is enough to check the property for the maximin lower bound. Let n, m be two positive integers, and W be any 
prior on 9 (with {Pg : 9 G 6} C A). As sources from A are memoryless, X\- n and X n+ i- n+m are independent 
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conditionally on 9 and thus 

= H (X n+ l; n + m \Xl; n ) — H (X n+ l; n + m \Xl :n , 9) 

= H (X n+ i; n+m \Xi :n ) — H (X n+ i; n+m \9) 
< H (X n+ i; n+m ) — H (X n+ i :n+m \9) 
= I (X n -j-i :n -j- m ; 9) . 

Hence, using the fact that under each Pg, the process (AT„)„ e N + is stationary: 

I(X 1 ; 0) — I {X\. n \ 9) + I (X n+ i :n+m ; 9\Xi- n ) 

= I(X 1:n ;9) + I(X 1:m ;9). 

Let us now check the sub-additivity of the minimax regret. Suppose that i?*(A : ) is finite. For any e > 0, for 

x G X n+m , let P G A be such that (1 - e)p(x) < P"+ m (x). As for x G X n and x' G X' n , P"+'"(xx') = 
P n (x) x P m (x'), we have for any e > 0, and any x G X n ,x' £ X m 

(1 — t)p(xx') < p(x) x p(x') . 

Hence, letting e tend to 0, and summing over all x G X n+m : 

R* (A n+m ) 

= log X! P( xx ') 

< log ^2 p(x) + log p ( x ') 



R* (A™) + P* (A m ) . 



Remark 2: Counter-e xamples witness the fact that subadditivity of redundancies does not hold in full generality 
The Fekete Lemma (see iDembo and Zeitouni , 



1998) leads to: 

Corollary 1: Let A denote a class of stationary memory less sources over a countable alphabet. For both minimax 
redundancy R + and minimax regret R*, 

lim *m = inf < P+ (A*) , 

n^oo n neN + n 



and 



.. P*(A n ) . , P*(A") n 
hm = mf < R* (A 1 
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Hence, in order to prove that R + (A™) < oo (respectively R* (A") < oo), it is enough to check that R + (A 1 ) < oo 
(respectively R* (A 1 ) < oo). 

The following Proposition combines Propositions|2][3]and|4] It can be rephrased as follows: a class of memoryless 
sources admits a non-trivial strong minimax regret if and only if Shtarkov NML coding probability is well-defined 
for n = l. 

Proposition 5: Let A be a class of stationary memoryless sources over a countably infinite alphabet. Let p be 
defined by p(x) = sup PgA P{x}. The minimax regret with respect to A™ is finite if and only if the normalized 
maximum likelihood (Shtarkov) coding probability is well-defined and : 

R*(A n ) < oo <^4> V" p(x) < oo. 

x6N + 



Proof: The direct part follows from Proposition [2] 
For the converse part, if X)a:eN + p( x ) = oc > tnen = oo and from Proposition [3] R*(A n ) = oo for every 

positive integer n. ■ 
When dealing with smoothly parameterized classes of sources over finite alp habets (see 



Xie and Barron, 



Barron et al 



2000) or even with the massive classes defined by renewal sources dCsiszar and Shields , 



1998 



6) 



1996), the 



minimax regret and minimax redundancy are usually of the same order of magnitude (see Theorem[2]and comments 
in the Introduction). This can not be taken for granted when dealing with classes of stationary memoryless sources 
over a countable alphabet. 

Proposition 6: Let / be a positive, strictly decreasing function defined on IN such that /(l) < 1. For k € N, let 
Pk be the probability mass function on N defined by: 

l-f(k) if 1 = 0; 
Pk(l) = \ f(k) ifl = k; 

otherwise. 

Let A 1 = {pi,p2, ■ ■ ■}, let A be the class of stationary memoryless sources with first marginal A 1 . The finiteness 
of the minimax redundancy with respect to A" depends on the limiting behavior of /(fc)logfc: for every positive 
integer n: 

f(k) logk-^k^ oo R+ (A") = oo . 
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Remark 3: When f(k) = the minimax redundancy R + (A n ) is finite for all n. Note, however that this does 
not warrant the existence of a non-trivial strong universal redundancy rate. However, as J^k f(k) = oo, minimax 
regret is infinite by Proposition [5] 



A similar result appears in the discussion of Theorem 3 in dHaussler and Oppen, 119971) where classes with finite 



minimax redundancy and infinite minimax regret are called irregular. 

We will be able to refine those observations after the statement of Corollary [2] 

Proof: Let us first prove the direct part. Assume that f(k) logfc ^fc^oo oo. In order to check that R + (A 1 ) = 
oo, we resort to the mutual information lower bound (TheoremQ} and describe an appropriate collection of Bayesian 
games. 

Let m be a positive integer and let 9 be uniformly distributed over {1,2,..., m}. Let X be distributed according 
to pk conditionally on 9 = k. Let Z be the random variable equal to 1 if X = 9 and equal to otherwise. Obviously, 
H(9\X, Z = 1) = 0; moreover, as / is assumed to be non-increasing, P(Z = 0\9 = k) = 1 — f(k) < 1 — /(to) 
and thus: 

H(6\X) = H(Z\X) + H(9\Z, X) 

< 1 + P(Z = 0)H(6\X, Z = 0) 

+p(z = i)H(e\x,z = i) 

< 1 + (1 — f{m)) logm. 

Hence, 

R+iA 1 ) > 1(9, X) 

> log m — (1 — f(m)) log m 
= f(m) log to 

which grows to infinity with m, so that as announced iT^A 1 ) = oo. 

Let us now prove the converse part. Assume that the sequence (f(k) log fc)fcgN + is upper-bounded by some 
constant C. In order to check that R + (A n ) < oo, for all n, by Proposition|4j it is enough to check that R + (A 1 j) < oo, 
and thus, it is enough to exhibit a probability distribution Q over X = N such that sup PeA i D(P, Q) < oo. 

Let Q be defined by Q(k) = A/ ({I V (fc(logfc) 2 )) for k > 2, Q(0), Q(l) > where A is a normalizing constant 
that ensures that Q is a probability distribution over X. 
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Then for any k > 3 (which warrants fc(logfc) 2 > 1), letting P^ be the probability defined by the probability 
mass function pf.: 

D(P k ,Q) 

M1 _ /w)log (i^ + /(t)log (/<W! 

< - log Q(0) + C + f(k) (2 log< 2 > (k) - log(A)) 

r 2 

< C + log 



AQ(Q) 



This is enough to conclude that 



^(A 1 ) < C + log 



C 2 



AQ(0) 



VD(P 1 ,Q)VD(P 2 ,Q) < 00 



Remark 4: Note that the coding probability used in the pr oof of the co nverse part of the proposition corresponds 
to one of the simplest prefix codes for integers proposed by lEliasI (119751) . 



The following theorem shows that, as far as envelope classes are concerned (see Definition [T), minimax redun- 
dancy and minimax regret are either both finite of both infinite. This is indeed much less precise than the relation 
stated in Theorem [2] about classes of sources on finite alphabets. 

Theorem 3: Let / be a non-negative function from N + to [0, 1], let Af be the class of stationary memoryless 
sources defined by envelope /. Then 



R+ (A'}) <oo^R* (A™) < 



Remark 5: We will refine this result after the statement of Corollary [2] 
Recall from Proposition \5\ that R* (^j) < 00 J2keN + /(&) < °°- 

Proof: 
In order to check that 

/(*) = 00 => r + (a 1 ;) = 00 , 

fcGN+ 

it is enough to check that if X^<=n + /(^) = 00 ' enve l°P e class contains an infinite collection of mutually 
singular sources. 

Let the infinite sequence of integers (/ij) i£M be defined recursively by ho = and 



h i+ i = mini h : ^ f(k) > 1 \ 

{ k=ht+l ) 
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The memoryless source Pi is defined by its first marginal Pi which is given by 

Pl(m) = JJ m) for m e { Pi + 1, ...,p i+1 } . 
Taking any prior with infinite Shannon entropy over the {Pi ; i 6 N+} shows that 

R + ({Pi ; i e N+}) = oo . 



III. Envelope classes 

The next two theorems establish quantitative relations between minimax redundancy and regrets and the shape 
of the envelope function. Even though the two theorems deal with general envelope functions, the reader might 
appreciate to have two concrete examples of envelope in mind: exponentially decreasing envelopes of the form 
Ce~ ak for appropriate a and C, and power-laws of the form Ck~ a again for appropriate a and C. The former 
family of envelope classes extends the class of sources over finite (but unknown) alphabets. The first theorem holds 
for any class of memoryless sources. 

Theorem 4: If A is a class of memoryless sources, let the tail function F A * be defined by F^i (u) = 2~2k>uP(k), 
then: 

r _ a. - 1 i 

2. 



R*(A n ) < inf 

u:u<n 



u — 1 

nF A i (u) log e H — log n 



Choosing a sequence (u n ) n of positive integers in such a way that u n — > oo while (u n logn)/n — > 0, this 
theorem allows to complete Proposition [5] 

Corollary 2: Let A denote a class of memoryless sources, then the following holds: 

R*(A n ) < oo & R*(A n ) = o(n) and R + (A n ) = o{n) . 



Remark 6: We may now have a second look at Proposition [6] and Theorem [3] In the setting of Proposition [6] 
this Corollary asserts that if ^2 k f(k) < oo, for the source class defined by /, a non-trivial strong redundancy rate 
exists. 

On the other hand, this corollary complements Theorem[3]by asserting that envelope classes have either non-trivial 
strong redundancy rates or infinite minimax redundancies. 

Remark 7: Again, this statement has to be connected with related propositions from lHaussler and Opperl (119971) . 



The last paper establishes bounds on minimax redundancy using geometric properties of the source class under 
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Hellinger metric. For example, Theorem 4 in (H aussler and Oppeii 1 1997b relates minimax redundancy and the 
metric dimension of the set A" with respect to the Hellinger metric (which coincides with Li metric between the 
square roots of densities) under the implicit assumption that sources lying in small Hellinger balls have finite relative 
entropy (so that upper bounds in Lemma 7 there are finite). En velope classes may not satisf y this assumption. Hence, 



1997). 



there is no easy way to connect Theorem [4] and results from (IHaussler and Oppei , 

Proof: (Theorem [4]) Any integer u defines a decomposition of a string x E N™ into two non-contiguous 
substrings: a substring z made of the m symbols from x that are larger than u, and one substring y made of the 
n — m symbols that are smaller than u. 



E #*) 



xei 



(a) 



E 

m=0 



^ E 

m=0 



E E 

z£{u+l,...} m yS{l,2,...,«}' 



p(zy) 



E n*w E 

zG{«+i,...} m i=i ye{i, 2, ...,«}« 



p(y) 



(c) 

< 



E 



E 

,ye{l,2,. ..,«}« 



< (l + F A1 ( U )) n 2- 



;n+2 



Equation (a) is obtained by reordering the symbols in the strings, Inequalities (b) and (c) follow respectively from 
Proposition |4] and Proposition [3] Inequality (d) is a direct consequence of the last inequality in Theorem [2] 

Hence, 

1 



R*(A n ) < nlog(l + F A i(u)) + 



log n + 2 



u — 1 

< 71-Fa 1 ( m ) log e H log n + 2 



The next theorem complements the upper-bound on minimax regret for envelope classes (Theorem[4]i. It describes 
a general lower bound on minimax redundancy for envelope classes. 

Theorem 5: Let / denote a non-increasing, summable envelope function. For any integer p, let c(p) = Yjk=i fftfy- 
Let c(oo) = 53fc>i Assume furthermore that c(oo) > 1. Let p £ N+ be such that c(p) > 1. Let n € N+, 



e > and A e]0, 1[ be such that n > ^y^rrjy • Then 



R+(AJ) >C(p,n, A, e)E(^ lo S 



n( l-A)7r/(2i) 
2c(p)e 



where 



C(p, n, A, e) - c, P ) (l - (i-AWw) 



A2„/(2p) 
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Before proceeding to the proof, let us mention the following non- asymptotic bound from lXie and Barronl dl997l) 
Let m* denote the Krichevsky-Trofimov distribution over {0, 1}™. That is, for any x £ {0, 1}", such that ri\ = 

Y^=i Xi and no =n— ni 

m* n (x) = - f ni ~ 1/2 (l - 9) n "~ 1/2 d9. 



w J [0,1 



Lemma 1: dXie and Barron , 



19971 Lemma 1) For any e > 0, there exists a c(e) such that for n > 2c(e) the 



following holds uniformly over 9 G [c(e)/n, 1 — c(e)/n\. 

n 



1 n 



The bound c(e) can be chosen as small as 5/e. 

Proof: The proof is organized in the following way. A prior probability distribution is first designed in such 
a way that it is supported by probability distributions that satisfy the envelope condition, have support equal to 
{1, , . . . , 2p}, and most importantly enjoys the following property. Letting N be the random vector from N p defined 
by Nj(x) = \{j : x.,- £ {2i — 1, 2i}}|, where x G X n , letting Q* be the mixture distribution over X n defined by 
the prior, then for any P in the support of the prior, P n /Q* = P n {. | N}/Q*{. | N}. This property will provide 
a handy way to bound the capacity of the channel. 

Let /, n,p, e and A be as in the statement of the theorem. Let us first define a prior probability on Ay. For each 
integer i between 1 and p, let be defined as 

W = ^ 
c(p) 

This ensures that the sequence (//i)i< p defines a probability mass function over {1, . . . ,p}. Let 9 = (6i)i<i< p 
be a collection of independent random variables each distributed according to a Beta distribution with parameters 
(1/2, 1/2). The prior probability W = ®f =1 Wj for 6 on [0, l] p has thus density w given by 

p 



»w = in(*r l/a (i-ft)- 1/a ). 

1=1 



The memoryless source parameterized by 9 is defined by the probability mass function pg(2i — 1) = 6i /i, and 
Pe(2i) = (1 — for i : 1 < i < p and pe(j') = for j > 2i. Thanks to the condition c(p) > 1, this probability 
mass function satisfies the envelope condition. 

For i <p, let the random variable Ni (resp. Nf) be defined as the number of occurrences of {2i — 1, 2i} (resp. 
2i — 1) in the sequence x. Let N (resp. N°) denote the random vector N±, . . . , N p (resp. N®, . . . , N®). If a sequence 
x from {1, . . . , 2p} n contains n,; = A^(x) symbols from {2i — 1, 2i} for each i E {1, . . . ,p}, and if for each such 
i, the sequence contains n° = AT°(x) (n|) symbols equal to 2i — 1 (resp. 2i) then 

p 
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Note that when the source 6 is picked according to the prior W and the sequence X\. n picked according to Pg, 
the random vector N is multinomially distributed with parameters n and (fj,i,fj,2, ■ ■ ■ , H P ), so the distribution of N 
does not depend on the outcome of 9. Moreover, conditionally on N, the conditional probability Pg {■ | N} is a 
product distribution: 

p£(x|N)=n(cMi-^) ■ 

i=l 

In statistical parlance, the random vectors N and N° form a sufficient statistic for 0. 



Let Q* denote the mixture distribution on N™ induced by W: 

Q*(x) = E w [i?(x)] , 
and, for each n, let m* denote the Krichevsky-Trofimov mixture over {0, l} n , then 

Q*( X )=n(Mr<(o n? i"* 1 )) ■ 

i=i 

For a given value of N, the conditional probability Q* {• | N} is also a product distribution: 

Q* (x|N) = l[m* Ni (0 n h<), 

i=l 

so, we will be able to rely on: 

Pert = p e l (x I N) 
Q*(x) Q*(x|N)' 



Now, the average redundancy of Q* w i m respect to Pg can be rewritten in a handy way. 



[D(P?,Q*] 



E 



E; 



log 



P n (X 1: „ | N) 



Q*(X 1:n | N) 



from the last equation, 



E 



E; 



log 



Pe(Xl:n | N) 
0*(^1:„ | N) 



In 



[Epn | N),Q*(- | N))]] 

:E W [E N [D(Pg(. |N),Q*(- | N))]] 
as the distribution of N does not depend on 0, 
E N [Em/ [D (P e "(- | N), Q*(- | N))]] 
by Fubini's Theorem. 
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We may develop D (P™(- | N), Q*(- | N)) for a given value of N = (ni,n 2 , . . . , n p ). As both P™(- | N) and 
Q*(- | N) are product distributions on JlLi (i 2 * - 1> 2«}™ ! ) i we nave 

v 



V W [D(P2(- |N),Q*(-|N))]=E 



5^E Wi [i3(J^,0] 



The minimal average redundancy of A'i with respect to the mixing distribution W is thus finally given by: 



EV [D {n\Q*)\ = E N 



i=l 

= ^E N [Ewi [D (P2;,m* ni )]] 

i=l 

P " / \ 



(i) 



i=l n ; =0 v 7 

Hence, the minimal redundancy of A* with respect to prior probability W is a weighted average of redundancies 
of Krichevsky-Trofimov mixtures over binary strings with different lengths. 



At some place, we will use the Chebychef-Cantelli inequality (see 
a square-integrable random variable: 



Devrove et al 



1996) which asserts that for 



Pr{X < E[X] -t] < 



Var(AT) 
Var(A) + 1 2 



Besides, note that for all e < |, 

dx ,4 



> 1 - -V~e- (2) 

TT^/x(l-X) 7T 
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Now, the proposition is derived by processing the right-hand-side of Equation (Q]). 

Under condition (1 — A) n > ™, we have ^ < \ for alH < p such that rij > (1 — A) nfii. Hence, 



% [D (P£,Q*)} 

V n 

^E E 

i=l m>(l— X)nfii 
p n 

^E E 

i=l ni>(l — \)nfii 

by Proposition Q] from 



d9, 



^log^+logTT-e 



Xie and Barronl d 19971) 



TTy/Oiil-Oi) 



> 



f t (:)^<i-,. r-.( 1 -i^:)(iio 6 ^ + 

t=l n< >(l-A)n/i 4 V 7 \ V w \ / 



from (f2]i 



^E E 

»=1 n.<>(l— A)n/ii 

using monotonicity of x log x 



> 



E 1+ i-M, f 1 (i 
i=i T tTtut 7 \ v v 



> 



A) n/ije 

invoking the Chebychef-Cantelli inequality, 
1 



2 log 



/ 5 ^ 


(1- 


- A) n/^e y 


n(l 





1 n(l - A) Mi . , 

-log ^ +logn-e 



log 7T — £ 



1 



c(p) 



4 



5c(p) 



ttV (l-X)enf(2p) 



A 2 «/(2p) \ 

using monotonicity assumption on /. 



1 ,„.»(!- A >/( 2i ) 



2c(p)7TC 



log 7r — e 



IV. Examples of envelope classes 

Theorems [3] |4] and [5] assert that the summability of the envelope defining a class of memoryless sources 
characterizes the (strong) universal compressibility of that class. However, it is not easy to figure out whether 
the bounds provided by the last two theorems are close to each other or not. In this Section, we investigate the case 
of envelopes which decline either like power laws or exponentially fast. In both cases, upper-bounds on minimax 
regret will follow directly from Theorem |4] and a straightforward optimization. Specific lower bounds on minimax 
redundancies are derived by mimicking the proof of Theorem [5] either faithfully as in the case of exponential 
envelopes or by developing an alternative prior as in the case of power-law envelopes. 

A. Power-law envelope classes 

Let us first agree on the classical notation: £(a) = Ylk>i P 7 > ^ or a > 1 • 

Theorem 6: Let a denote a real number larger than 1, and C be such that C£(a) > 2 a . The source class A c ,- a 
is the envelope class associated with the decreasing function f a .c :a;i->lAp for C > 1 and a > 1. 
Then: 
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1) 



n 1/a A(a)log (CC(a)) 1/a < i?+(A™ . 



where 



A(a) 



1 

a j 



1 



.l-l/Q 



-V(C(a)u) du 



2) 



^(AS.-) < (|^) (logn) 1 " 1 ^ + 0(1) . 

Remark 8: The gap between the lower-bound and the upper-bound is of order (log?i) 1- ° . We are not in a 
position to claim that one of the two bounds is tight, let alone which one is tight. Note however that as a — > oo 
and C = H a , class Ac.-o converges to the class of memoryless sources on alphabet {1, . . . ,H} for which the 
minimax regret is log n. This is (up to a factor 2) what we obtain by taking the limits in our upper-bound of 
i?*(AJi, _ a ). On the other side, the limit of our lower-bound when a goes to 1 is infinite, which is also satisfying 
since it agrees with Theorem [3] 

Remark 9: In contrast with various lower bounds derived using a similar methodology, the proof given here 
relies on a single prior probability distribution on the parameter space and works for all values of n. 

Note that the lower bound that can be derived from Theorem [5] is of the same order of mag nitude 0{n l ' a ) 
the lower bound stated here (see Appe ndix |H|>. The proof give n here is completely elementary and does not rely 



Xie and Barron 



(1997) 



on the subtle computations described in 

Proof: For the upper-bound on minimax regret, note that 

^c(.) = £lA-^<^ 

k>u v 



c 



l)u 



a-l ' 



Hence, choosing u n = 



2Cn 



(a — 1) log n 

i?*(A™ 



resorting to Theorem |4] we get: 



.)< 



2Cn 
a-l 



l/a 



(logn) 



1-1/a 



0(1). 



Let us now turn to the lower bound. We first define a finite set of parameters such that G A™ c for any 
6 £ and then we use the mutual information lower bound on redundancy. 
Let m be a positive integer such that m a < C((a) . 

The set {Pg,0 G 9} consists of memoryless sources over the finite alphabet N+. Each parameter 9 is a 
sequence of integers = (Q\, 0%, ■ ■ ■ , ). We take a prior distribution on 8 such that (6k)k is a sequence of 
independent identically distributed random variables with uniform distribution on {1, . . . , m}. For any such 0, Pg 
is a probability distribution on N+ with support Uk>i{(k — l)m + 0k}, namely: 

1 m a 1 



P g {{k-l)m + 9 k ) 



C(a)k a ((a) (km) a 
The condition m a < C((a) ensures that Pg G c . 



for k > 1. 



(3) 
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Now, the mutual information between parameter 6 and source output X\. n is 

I(O,X 1 .. n )=J2 I (0k,X 1 .. n ) 



k>l 



Let N k (x) = 1 if there exists some index i G {1, . . . , n} such that G [(fc — l)m + 1, km], and otherwise. 
Note that the distribution of N k does not depend on the value of 0. Thus we can write: 



I(6 k ,X 1:n )=I(e k ,X 1:n \N k = 0)P(N k = 0) + I(6 k ,X 1:n \N k = l)P(N k = 1). 

But, conditionally on N k = 0, 6 k and X\. n are independent. Moreover, conditionally on N k = 1 we have 

P(6 k =j\X 1:T 



I(6 k ,X lm \N k = l)=E 



log- 



>k=J) 



■\N k = 1 



log m. 



Hence, 



I{0,X 1:n ) = J2 p (Nk = l)logm 



k>l 



= E Pe [Z n ] logm, 

where Z n (x) denotes the number of distinct symbols in string x (note that its distribution does not depend on the 
value of 6.) As Z n = 5Zfc>i ^-N k (x)=li the expectation translates into a sum 



k=l 



which leads to: 



Now: 



^ ( 1 i 1 ((a)k a 



x log m . 



k=l 



> 



El 1 - 



exp 



fc=i 

as 1 — x < exp(— x) 



> ^ ( 1 - cxp 



C(a) fc c 



C(a)x Q 



d.r 



> 



dit . 



In order to optimize the bound we choose the largest possible m which is m = [(C^a)) 1 /"] 
For an alternative derivation of a similar lower-bound using Theorem [5] see Appendix ITTI 
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B. Exponential envelope classes 

Theorems [4] and [5] provide almost matching bounds on the minimax redundancy for source classes defined by 
exponentially vanishing envelopes. 

Theorem 7: Let C and a denote positive real numbers satisfying C > e 2a . The class A Ce - Q is the envelope 
class associated with function f a : x i— ► 1 A Ce~ ax . Then 

-L log 2 n (1 - o(l)) < R + (A n Ce . a . ) < E*(A£ e _ . ) < 1- log 2 n + 0(1) 
oof /a 

Proof: For the upper-bound, note that 

Hence, by choosing the optimal value u n = ^ logn in Theorem [4] we get: 

ir(AS e _ Q .)<i-log 2 n + 0(l). 

We will now prove the lower bound using Theorem [5] The constraint C > e 2a warrants that the sequence 
c(p) = Yl=i f( 2k ) > Ce' 2 " is larger than 1 for all p. 

If we choose p = \J- (logn — loglogn)J, then nf(2p) > Cne _los ' 1+ ' oglosn_2a goes to infinity with n. For 
e = A = i, we get C(p, n, A, e) = 1 — o(l). Besides, 

A/1, nil - \)Cirer 2m \ p / , (1 - AJCtt „ A A 



^log 2 n-f^log 2 n )(l + o(l)) 



^ log 2 n (l + o(l)) 



V. A CENSORING CODE FOR ENVELOPE CLASSES 

The proof of Theorem [4] suggests to handle separately small and large (allegedly infrequent) symbols. Such an 
algorithm should perform quite well as soon as the tail behavior of the envelope provides an adequate description of 
the sources in the class. The coding algorithm suggested by the proof of Theorem@] which are based on the Shtarkov 
NML coder, are not computationally attractive. The design of the next algorithm (CensoringCode) is again guided 
by the proof of Theorem |4] it is parameterized by a sequence of cutoffs (Ki)i^ and handles the i th symbol of 
the sequence to be encoded differently according to whether it is smaller or larger than cutoff Ki, in the latter 



situat ion, the symbol is said to be censored. The CensoringCode al gorithm uses Elias penultimate c ode (lElias . 



198 lb to encode 



1975b to encode censored symbols and Krichevsky-Trofimov mixtures (iKrichevskv and Trofimovl 
the sequence of non-censored symbols padded with markers (zeros) to witness acts of censorship. The performance 
of this algorithm is evaluated on the power-law envelope class A c .- a , already investigated in Section |IV] In this 



February 2, 2008 



DRAFT 



20 



section, the parameters a and C are assumed to be known. 

Let us first describe the algorithm more precisely. Given a non-decreasing sequence of cutoffs (Ki)i< n , a string 
x from N™ defines two strings x and x in the following way. The i th symbol Xj of x is censored if jq > fcj. String 
x has length n and belongs to Yl™=i where Xi = {0, . . . Ki}: 

Xj if x, < Ki 

otherwise (the symbol is censored). 
Symbol serves as an escape symbol. Meanwhile, string x is the subsequence of censored symbols, that is 

( x i)xi>Ki,i<n- 

The algorithm encodes x as a pair of binary strings CI and C2. The first one (CI) is obtained by applying Elias 
penultimate code to each symbol from x, that is to each censored symbol. The second string (C2) is built by 
applying arithmetic coding to x using side-information from x. Decoding C2 can be sequentially carried out using 
information obtained from decoding CI. 

In order to describe the coding probability used to encode x, we need a few more counters. For j > 0, let n\ be 
the number of occurrences of symbol j in xi : j and let n° be the number of symbols larger than A';+i in xx : , (note 
that this not larger than the number of censored symbols in xi : j, and that the counters n\ can be recovered from 
x 1:i and x). The conditional coding probability over alphabet X i+1 = {0, . . . , K i+1 } given x 1:i and x is derived 
from the Krichevsky-Trofimov mixture over X i+ i. It is the posterior distribution corresponding to Jeffrey's prior 
on the 1 + Ki + i -dimensional probability simplex and counts n\ for j running from to JQ+i: 

n j + i 

Q(X i+ i = j | X 1:i = x. 1:i ,X = x) = - — 1 2 . 
The length of C2(x) is (up to a quantity smaller than 1) given by 

n-l 

-^2^0gQ(*i+l I Xl :i ,x) = -l0gQ(x | X). 

i=0 

The following description of the coding probability will prove useful when upper-bounding redundancy. For 1 < 
j < K n , let s-? be the number of censored occurrences of symbol j. Let n- 7 serve as a shorthand for n J n . Let 
be n if K n < j or the largest integer i such that Ki is smaller than j, then Sj = n?,^. The following holds 

Note that the sequence (n°)i<„ is not necessarily non-decreasing. 

A technical description of algorithm CensoringCode is given below. The procedure EliasCode takes as 
input an integer j and outputs a binary encoding of j using exactly £(j) bits where I is defined b y: £(j) = 




log 7 + 2 log (1 + log j) + lj . The procedure ArithCode builds on the arithmetic coding methodology (IRissanen and Langdon , 



19791) . It is enough to remember that an arithmetic coder takes advantage of the fact that a coding probability Q is 
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completely defined by the sequence of conditional distributions of the i + 1th symbol given the past up to time i. 

The proof of the upper-bound in Theorem [6] prompts us to choose Ki = Xii , it will become clear afterward 
that a reasonable choice is A 



4C 
a-1 



Algorithm 1 CensoringCode 



K <- 

counts <— [1/2, 1/2, . 
for i from 1, to n do 



cutoff < 



if cutoff > K then 

for j <— K + 1 to ci/fo^ do 

coM«fi[0] <— cot/nfi[0] — coimfs[j] + 1/2 
end for 
A' <— cutoff 
end if 

if x[i] < cutoff then 

ArithCode(x[i], coi<nf5[0 : cutoff]) 
else 

ArithCode(0, coM«f5[0 : cutoff]) 
Cl<—C 1 -EliasCode(x [i] ) 

COM«fi[0] «— COM«fi[0] + 1 

end if 

co«nfs[:z;[i]] <— coM«fi[x[i]] + 1 
end for 

C2^- ArithCode() 

Ci -C 2 



Theorem 8: Let C and a be positive reals. Let the sequence of cutoffs (Ki)i< n be given by 

Ki = 



4Ci ^ 1/a 



a — 1 

The expected redundancy of procedure CensoringCode on the envelope class A c .- a is not larger than 

(^r) Qlo g-( 1 + o ( 1 ))- 

Remark 10: The redundancy upper-bound in this Theorem is within a factor log?i from the lower bound 0(n 1 ^ 01 ) 
from Theorem [6] . 

The proof of the Theorem builds on the next two lemmas. The first lemma compares the length of C2(x) with 
a tractable quantity. The second lemma upper-bounds the average length of Cl(x) by a quantity which is of the 
same order of magnitude as the upper-bound on redundancy we are looking for. 

We need a few more definitions. Let y be the string of length n over alphabet X n defined by: 




For < j < K n , note that the previously defined shorthand n? is the number of occurrences of symbol j in y. 



February 2, 2008 



DRAFT 



22 



The string y is obtained from x in the same way as x using the constant cutoff K n . 
Let m* be the Krichevsky-Trofimov mixture over alphabet {0, . . . , K n }: 

String x seems easier to encode than y since it is possible to recover x from y. This observation does not however 
warrant automatically that the length of C2(x) is not significantly larger than any reasonable codeword length for 
y. Such a guarantee is provided by the following lemma. 

Lemma 2: For every string x <G N", the length of the C2(x) is not larger than — logm* (y). 

Proof: [Lemma 12 Let s° be the number of occurrences of in y, that is the number of symbols in x that are 
larger than K n . Let 

n 

n= n (»i-i + i/2). 

i=l,x j= 

Then, the following holds: 



0) 



n («?-!+ V2) ] ii n (^-i+v 2 ) 



W /r(s° + 1/2) \ y\ (T( S i + 1/2) 



> 



where (a) follows from the fact symbol x, is censored either because Xj > K n (that is y.; = 0) or because 

x.; = j < K n and i < (b) follows from the fact that for each i < n such that y^ = 0, > 2~2i'<i lx i>K n 
while for each j',0 < j < K n , for each i < n®_ l > n\_ x . 

From the last inequality, it follows that 



q(x I x) > i n T{n3 + 1/2) r(5 + 1/2) ft T{sJ + 1/2) ff - 

y(x|x, -|lJr( S Hi/2) r(i/2) 11 r(i/2) ^ 1 + £fi 



> 



iy + 1/2) \ r( 5 ° + 1/2) iy + 1/2) ( n fj 
LL T{sj + 1/2 )J r(i/2) 1 = 1 r(i/2) \U i + 

(y) , 



2 



where the last inequality holds since (Ki) i is a non-decreasing sequence. ■ 

The next lemma shows that the expected length of Cl(Xi :n ) is not larger than the upper-bound we are looking 
for. 

Lemma 3: For every source P 6 A c .- a , the expected length of the encoding of the censored symbols (Cl(Xi- n )) 
satisfies: 

E P [|C1(X 1: „)|] < ^—^—^logn (l + o(l)). 
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Proof: [Lemma O Let 1 < a < b and (3 > but (3^1. Recall that we use binary logarithms and note that: 



x p 



(1-/3) x?- 1 



logo; 



dec 



logx 



,13-1 



(4) 

(5) 



The expected length of the Elias encoding of censored symbols (Cl(Xp ))is: 



e[|ci(x; i )|] =e 



Note that for x > 2 7 



n oo 



E E 

j=l x=Kj + l 



n oo 



C 



<E E 



n oo 

^E E 

J=l x=Kj + l 



log(x) + 21og(l + loga;) + l 



logs > 21og(l + logs) + 1, 



so that the last sum is upper-bounded by 



n oo 



^E E 2 ^ + ^ E E 

j=lx=Kj + l ■j:K j <2 7 x=l 

Using the expressions for the integrals above, we get: 



21og(l + logs) + 1 - logs 



g iog. < r^ Ax 



< 



(a-l)i^r 1 ' 



Thus, as A'j = Aj 1 /", let us denote by D\ the expression 

2 7 



E E 

j:j<(2?/X)°> x=l 



21og(l + logs) + 1 — logs 
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Now, we substitute (3 by 1 — — in Equations (|4]i and (0 to obtain: 



or n ]asK + iHS£ 

:[|ci(xr)l]<^ + ^£^ + 



a-l 

<C£ A + - C+/ A L da; 



a(a - 1)A"- 1 



X 1 a 



2C 

(a-l)A"-^ ° l0gn(1 + ° (1)) 



We may now complete the proof of Theorem [8] 

Proof: Remember that X n = {0, . . . , K n }. If p is a probability mass function over alphabet X, let p® ra be 
the probability mass function over X n defined by p®"(x) = n"=iP( x ')- Note that for every string x £ N™, 

max p®"(y) > max p® n (x) > max P"(x)=j3(x). 



Toget her with Lemma|2]and the bounds on the redundancy of the Krichevsky-Trofimov mixture (See 



Krichevskv and Trofimov 



1981), this implies: 



|C2(x)| < -logp(x) + ^logn + 0(l). 



Let £(x) be the length of the code produced by algorithm CensoringCode on the input string x, then 

sup E P [L(X 1:n ) - log l/P n (X 1:n )] 

< sup E P [L(X 1:n ) - log l/p(X 1:n )\ 

< sup E P [|C2(X 1:n )|+logp(X 1:n ) + |cl(Xi :n )|] 

< sup(|C2 (x)| +logp(x)) + sup Ep[|Cl(Xl : „)|] 
Xni , 2C* 



^—^—^logn (l + o(l)). 



The optimal value is A = ( J , for which we get: 
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VI. Adaptive algorithms 

The performance of CensoringCode depends on the fit of the cutoffs sequence to the tail behavior of the 
envelope. From the proof of Theorem [8] it should be clear that if CensoringCode is fed with a source which 
marginal is light-tailed, it will be unable to take advantage of this, and will suffer from excessive redundancy. 

In this section, a sequence (Q n )„ of coding probabilities is said to be approximately asymptotically adaptive with 
respect to a collection (A r „) m£j vf of source classes if for each P £ U TOe xA TO , for each A m such that P £ A m : 

D{P n ,Q n )/R + (k^ n ) £ O(logn). 

Such a definition makes sense, since we are considering massive source classes which minimax redundancies are 
large but still o(]^^)- If each class A m admits a non-trivial redundancy rate such that i? + (AJ^) = 0(^77), the 
existence of an approximately asymptotically adaptive sequence of coding probabilities means that U m A m is feebly 
universal (see the Introduction for a definition). 



A. Pattern coding 

First, the use of pattern coding 



Qrlitskv et al 



(2004), 



Shamir! (120060 leads to an almost minimax adaptive 



procedure for small values of a, that is heavy-tailed distributions. Let us introduce the notion of pattern using the 
example of string x = "abracadabra", which is made of n = 11 characters. The information it conveys can be 
separated in two blocks: 

1) a dictionary A = A(x): the sequence of distinct symbols occurring in x in order of appearance (in the 
example, A = (a, b, r, c, d)). 

2) a pattern tb = ib(x) where tpi is the rank of xj in the dictionary A (here, ip = 1231415123). 

Now, consider the algorithm coding message x by transmitting successively: 
1) the dictionary A„ = A (x) (by concatenating the Elias codes for successive symbols); 



2) and th e p attern ^ n 



(2004J) or 



ib (x), using a minimax procedure for coding patterns as suggested by 



Qrlitskv et al 



Shamirl (120061) . Henceforth, the latter procedure is called pattern coding. 



Theorem 9: Let Q n denote the coding probability associated with the coding algorithm which consists in applying 
Elias penultimate coding to the dictionary A(x) of a string x from N" and then pattern coding to the pattern ip(x). 
Then for any a such that 1 < a < 5/2, there exists a constant K depending on a and C such that 

i?+(Q'\A£.- Q ) < Kn l/a logn 
Proof: For a given value of C and a, the Elias encoding of the dictionary uses on average 

E[|A n |] = K'ni log n 



bits (as proved in Appendix [IV}, for some constant K' depending on a and C. 
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If our pattern coder reaches (approximately) the minimax pattern redundancy 

'(*!:») 



(*i:n) = inf sup E P 

qem 1 (Ni) Pe£DTi(N+) 



log- 



9(*l:r 



the encoding of the pattern uses on average 



+ R$ (*!:„) < + ^ (*!:„) bits. 



But in 



Orlitskv et al 



according to 



(2004), the authors show that ($i ; „) is upper-bounded by O (^/n) and even O (j 1 ^ 



Shamirl ( 120041) (actually, these bounds are even satisfied by the minimax individual pattern redundancy). 



This re markably 



proved in 



simpl e method is however expected to have a poor performance when a is larg e. Indeed, it is 



Garivieri ( 120061) that (H>i :n ) is lower-bounded by 1.84 



log n 



(see also 



Shamirl (12006) and references 



therein), which indicates that pattern coding is probably suboptimal as soon as a is larger than 3. 



B. An approximately asymptotically adaptive censoring code 



Given the limited scope of the pattern coding method, we will attempt to turn the censoring code into an adaptive 
method, that is to tune the cutoff sequence so as to model the source statistics. As the cutoffs are chosen in such a 
way that they model the tail-heaviness of the source, we are facing a tail-heaviness estimation problem . In order to 
focus on the most important issues we do not attempt to develop a sequential algorithm. The n + 1th cutoff K n+1 
is chosen according to the number of distinct symbols Z n (x) in x. 

This is a reasonable method if the probability mass function defining the source statistics P 1 actually decays like 
pr. Unfortunately, sparse distributions consistent with A.-a may lead this project astray. If, for example, (Y n ) n is 
a sequence of geometrically distributed random variables, and if X n = 2 <* , then the distribution of the X n just 
fits in Ag.-o but obviously Z n (Xi- n ) = Z n {Yi-.n) = O (logn). 

Thus, rather than attempting to handle U a >oA.-<*, we focus on subclasses U Q >oW Q , where 



W a = jP : P £ A.-a, < liminf k a P L {k) < limsup k a P 1 (k) < oo 
The rationale for tuning cutoff K n using Z n comes from the following two propositions. 

Proposition 7: For every memoryless source P G W Q , there exist constants c\ and such that for all positive 
integer n, 

Cl n 1/a < E[Z n ] < c 2 n 1/a . 

Proposition 8: The number of distinct symbols Z n output by a memoryless source satisfies a Bernstein inequality: 



Zn < \nZn] 



(6) 



Proof: Note that Z n is a function of n independent random variables. Moreover, Z n is a configuration 
function as defined defined bv lTalagrandl i 1995b since Z n (x) is the size of a maximum subsequence of x satisfying 
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an hereditary property (all its symbols are pairwise distinct). Using the main theorem in 
this is enough to conclude. 



Boucheron et al 



(2000), 



Noting that Z n > 1, we can derive the following inequality that will prove useful later on: 



E 



1 




= E 


r^a — 1 





7 a-l 1 Z n >^E[Z n 



E 



r7(\ 
^11 



~ A z„<mz„ 



i 



i. 



p ■/.„ < -nz„ 



We consider here a modified version of CensoringCode that operates similarly, except that 

1) the string x is first scanned completely to determine Z n (x); 

2) the constant cutoff K n = fiZ n is used for all symbols x,, 1 < i < n, where fj, is some positive constant 

3) the value of K n is encoded using Elias penultimate code and transmitted before CI and C2. 
Note that this version of the algorithm is not sequential because of the initial scanning. 



(7) 



Algorithm 2 AdaptiveCensoringCode 

cutoff <— \iZ n (yi) {Determination of the constant cutoff} 
counts <- [1/2, 1/2, . . .] 
for i from 1 to n do 
if x[i] < cutoff then 

ArithCode(.T[i], counts[0 : cutoff]) 
else 

ArithCode(0, counts[0 : cutoff]) 
C 1 <— C 1 -EliasCode(a; [i] ) 
counts[0] <— counts]0] + 1 
end if 

co««fi[x[i]] <— COM«fi[x[l]] + 1 

end for 

C2^- ArithCode() 

Ci -c 2 



We may now assert. 

Theorem 10: The algorithm AdaptiveCensoringCode is approximately asymptotically adaptive with respect 

Proof: Let us again denote by Cl(x) and C2(x) the two parts of the code-string associated with x. 
Let L be the codelength of the output of algorithm AdaptiveCensoringCode. 
For any source P: 



E; 



L(X 1:n ) -H{X 1:n )=E P 



< 



^„) + |Cl(X 1:n )| + |C2(X 1:n )| 
£(K n )] +Ep[|ci(X 1: „)|]+E F 



-nY,P\k)log 



1 



k=l 



K„ 



\C2(X 1:n )\-72j2 pl (k) log 



k=l 



P 1 (k) 



As function I is increasing and equivalent to log at infinity, the first summand is obviously olEp £(K n ) 
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Moreover, if P £ W a there exists C such that P 1 (k) < and the second summand satisfies: 



l P [|C1 [X 1:n )\] = E P 

< nCEp 
= nCE P 



k>K„ + l 

Jftn 



-Ax 



1 



TSCt — 1 , 
An Jl 



(k n u) 



< nCEp 



K, 



a-l 



log (u) 



d(/ 



d« (l + o(l)) 



O ( n " log n 



by Proposition and Inequality (0. 

By Theorem 12 every string x E N™ satisfies 



C2(x)|-nE^(fc)log^y < ^logn + 2. 



Hence, the third summand is upper-bounded as: 



|C2(X 1:n )|-nE pl ( fc ) lo g^fc) 



fc=i 



Ei 



< 



log n + 2 



= O ( 7i <* los 



which finishes to prove the theorem. 
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Appendix I 
Upper-bound on minimax regret 



This sections contains the proof of the last inequality in Theorem [2] 

The minimax regret is not larger than the maximum regret of the Krichevsky-Trofimov mixture over m-ary 
alphabet over strings of length n. The latter is classically upper-bounded by 

log /r(n + f) r(i) 



r(n + |)r(f); ' 



as proved for example in (ICsiszar 



1990). 



Now the Stirling approximation to the Gamma function (See 
that for any x > 0, there exists (3 6 [0, 1] such that 

T(x) = x x ~^e~ x V2ire 1 ^ ■ 



Whittaker and Watsonl 



19961 Chapter XII) asserts 
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Hence, 



r(n + f)r(i)\ ( m-l\, ( , m\ . ( , \\ m-1 



5 ' 4 r(n + in4) ) = { n + —) log { n+ Y)- nlog { n+ 2)-— log 7 (8) 

m\ 1 m 

+ + - (9) 

log V2?r + log V2?r + log \A2tt - logy^r (10) 

ft ft ft 



12(n+f) 12(n+|) 6m 
for some ft, ft, ft E [0, 1]. Now, (|9l>+([T0l>+(|TT|» is smaller than ^ + log V2 H , 1 , < 2, and (O equals: 

m-1, / , n + ^ m-1, \ /m-1, ?^+^ m-1, m-1, 
— ^— logn+ log - - | — loge j + lo S m + ~ ^~ lo S e — lo S n 

But 

n+f ,/ 2ti \ 2^1 , m-1 



(11) 



nlog — — f = nlog 1 + — s-j- < n — f-^-loge < — - — loge, 



and 



m-1, n + ^ m— 1, m-1, m-1, ( n + ^ ) e 

— ^— lo S "i" + lo S e — lo S' n = lo S nm < 

if (n + ^) e < that is + i < -, which is satisfied as soon as m and n are both at least equal to 9. For the 
smaller values of m, n E {2, . . . , 8} the result can be checked directly. 



Appendix II 

Lower bound on redundancy for power-law envelopes 

In this appendix we derive a lower-bound for power-law envelopes using Theorem [5] Let a denote a real larger 
than 1. Let C be such that C 1 '" > 4. As the envelope function is defined by f(i) = 1 A C/i a , the constant 

c(oo) = J2i>i f( 2i ) satisfies 

1 < c(oo) < V 



a-1 2 - v/ ~2 (a - 1)2" \ 2 

The condition on C and a warrants that, for sufficiently large p, we have c(p) > 1 (this is indeed true forp > C 1 /"). 
We choose p = an™ for a small enough to have 

(1 - A)Ce 



(2a) = c(oo) 

c(p) 



> 10, 



so that condition (1 — X)n ^f p ) > — is satisfied for n large enough. Then 



i?+(A?) > C(p, n,X,e)±(\ log "(*-;*)*/(*) _ 



1=1 



2 ° 2c(»e 
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where C(p, n, A, e) = 1 1 (2a ^ (oo) (l - f y / gf jgx^r )■ and 

P /I / -i \\ /-/r,-\ \ P 



^,1, n (1 — A) 7r/(2i) \ p, av^, A, (l-AWC 

Z— 1 v V y 7 2—1 v v ' 



e p 



p . a . , ... / 1 , (1 — A) 7rC 

= - log n - - (p log p - P + o(p) ) + ^ - log 2l+ac(oo)e - e ) p 

on a a / i a i, J. / : 

= — - — logn — — I aw log a H n° logn — an a + o \ n c 

1 (I-A)ttC \ ± 
- log -77 ; r e ati° 

2 6 2 1 + a c(oo)e J 

«/-. i ^ !, (I-A)ttC „.\ 1 

For a small enough, this gives the existence of a positive constant 77 such that R + (AJ) > ?yn°. 



Appendix III 
Proof of Proposition [7] 

Suppose that there exist ko, c and C such that for all k > ko, < Pk < pr- 
For < x < ±, it holds that -(21og2)x < log(l - x) < —x and thus 

e -(21og2)nr < (1 _ x y < e -W i 

Hence (as p,t < | for all fc > 2) : 

C 



OO OO 

2 (l-e-^)<E[Z n ]<l + £(l-, 

k—ko fc— 2 

y (1 - e -^)d2; < E[Z n ] < 1 + y (l-e i° — J dr. 



(2 1og2)Cr 



But, for any t,K > 0, it holds that 



1 — e ^° ) dx = - — — / — ~ ~t~> — du. 







11 



1+1/c 



Thus, by noting that integral 

l-e~ u 



A ( a ) = I -l+l/„ dU > 







IT 



is finite, we get 

^g)»V°(l- (l))<E[Z n ]<« 21 °« 2 ^ 1/C 
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Appendix IV 
Expected size of dictionary encoding 

Assume that the probability mass function (jpk) satisfies pk < for C > and all k > 0. Then, using Elias 
penultimate code for the first occurrence of each symbol in Xi- n , the expected length of the binary encoding of 
the dictionary can be upper-bounded in the following way. Let Uk be equal to 1 if symbol k occurs in X\- n , and 
equal to otherwise. 



E \\A r , 



E ^£4£(fc) 
_fc=i 

oo 

Y^E[Ukl(k)] 



fe=i 

/ OO 

< 2 i+e(i- 



(2 1og2)Cn 

) logfc 



< 2 1 



k=2 



(2 1og2)C 

1 — e ) log x ax 



< / ((21og2)Cn) 1/a /- (21og2)C " l-e-" lo / (21og2)Cn ^ 



^ T ((21o g 2yn)V" logn 



1 - e 



o u 



l+l/a 



■du 



for some positive constant T. 
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